## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
We can see from the summary table, there are some variables that may have outliers, like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. Especially for residual.sugar, total.sulfur.dioxide and chlorides, the maximum values are very far away from the 3rd quantile.
We can see that most of red winds in our dataset get rated in 5 and 6.
The fixed.acidity and volatile.acidity variables seem like normally distributed, however citric.acid is pretty right skewed and there is no much change after applying the log-transform and sqrt-transform.
The residual.sugar and chlorides variables are normally distributed except that there are some outliers for both of them.
Both variables of free.sulfur.dioxide and total.sulfur.dioxide are skewed to the right and have some outliers. After applied the log-transform, they seemed normally distributed.
The density, pH and sulphates variables look normally distributed. And we can see that the variance of density is very small, most of values are in the range between 0.993 and 1.
The alcohol variable is right skewed and there is no big change after applying sqrt and log tranform.
There are 1599 observations with 12 features. The variable quality is discrete and other variables are continuous.
The main feature in the dataset is quality. And I’d like to find which features have impact in determing the quality of red wines.
The features like volatile acidity, citric acid, free sulfur dioxide,total sulfur dioxide and sulphates may have correlation with quality based on the information I get in the doc file provided by the author of dataset.
No new variable has been created right now.
I applied the log-transform to the right skewed variables including citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and alcohol to get better insights about their distributions.
Since I want to check the attributes’s correlation and it’s not that clear for me to do the analysis based on matrix plot, I’ll add a correlation matrix plot.
According to the matrix plot and Spearman correlation coefficient matrix, we can see that: + the coefficients of correlation between quality with variables like alcohol, volatile acidity, citric acid and sulphates are 0.476, -0.391,0.226 and 0.251 correspondingly, which means these variables have relatively higher correaltions with quality compared to other vatiables. + Besides the four variables mentioned above, there are some variable including density, total sulfur dioxide, chlorides and fixed acidity which has lower coefficient of correlation (smaller than 0.2) but may also be related with quality. + There are also some moderate correlations between variables not including quality. For example, the relationships between citric acid with fixed acidity,volatile acidity and pH, density with fixed acidity and alcohol and pH with fixed acidity.
So let’s dig deep into this.
Before I start the following analysis, I’ll change quality to factor type since it includes only 6 discrete values.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
## [1] "Summary of alcohol:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
From the boxplots of quality with alcohol, it seems like that the red wines with higher quality scores have a larger median amount of alcohol if we only consider about the wines with quality score above 5. And we can also see that there are a lot of outliers for wines with quality of 5. So it’s difficult to discribe the relationship between alcohol and quality according to the boxplot. But with the combination of scatter plot, we can clearly see that there is a positive correlation between the two variables. Although the correlation is only moderate (r = 0.476, p-value < 0.001), but the pretty low p-value is a strong evidence that the correlation is reliable.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
## [1] "Summary of volatile.acidity:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
## [1] "Summary of citric.acid:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## [1] "Summary of sulphates:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
It seems like the correlation between quality and sulphates is slightly positive and there are many outliers. So I add xlim in scatter plot in order to find a better insight about the relationship. + In the scatter plot, we can find that there is an increasing trend when sulphate in wine is under 0.9. And it seems like quality is slightly negative related with sulphate over 1.0. + Then I go and check the description of attributes provided by author, it says that sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, which makes our observation reasonable. + And same as the previous analysis, the low p-value(r = 0.251, p- value < 0.001) also makes the positive correlation reliable.
For total sulfur dioxide, chlorides and fixed acidity, the relationships with quality are not that strong so we cannot decribe their effect on quality clearly. Only for density, we can see a slightly negative correlation with quality. However, the variance of density is so small that I think it is not even possible to be detected by our sense of taste. I guess the relationship is observed because density of water depends on the percent alcohol and sugar content which is mentioned in the doc file provided by the author of dataset.
We can see that the pH is negatively related with citric.acid and fixed.acidity, and it makes sense because pH less than 7 is said to be acidic. And citric.acid is positively correlated with fixed.acidity, and negatively correlated with voltatile.acidity.
The plots above indicate that density is positively related with fixed.acidity and negatively related with alcohol, which makes sense since the density of tartaric acid (fixed.acidity), water and alcohol is 1790 kg/m^3, 1000 kg/m^3 and 806 kg/m^3 correspondingly.
The relationships I observed include the positive correlation between quality with alcohol, citric.acid and sulphates, and negative correlation between quality with volatile.acidity and density. And I add a smoothing line in the scatter plot to help us identify the relationships between quality and other attributes of red wines. It seems like most of the relationships are not exactly linear.
The citric.acid is positively correlated with fixed.acidity, and negatively correlated with voltatile.acidity. And I find that pH is negatively related with citric.acid and fixed.acidity, density is positively related with fixed.acidity and negatively with alcohol, which are obviously reasonable.
The strongest relationship I found is that between quality and alcohol, which has the highest r value (0.476) compared to other correlations of attributes with quality. And the second one is between quality and volatile.acidity (r = -0.391). The last two are correlations of quality with sulphates (r = 0.251) and citric.acid (r = 0.226).
Based on the plot above, we can observe that wines with quality of 7 and 8 are mostly located in the right-bottom part when compare to points with quality of 3 and 4. That means wines with high quality have relatively higher citric acid and lower volatile acidity.
According to the quality scatter plot by alcohol and volatile.acidity, we can see that points with same quality are less dispersed in the horizontal dimension compared to the first multivariate plot in this section, which is unsurprisingly since alcohol has stronger correlation with quality than that citric acid has.
At this point, we get a relatively clearer scatter plot in this section. we can obviously see that the points of quality 7 and 8 are mostly located in the right-upper of plot, and most of points with quality 3,4,5 are in the left part. And points with same quality are less dispersed in horizontal level versus that in vertical level. So once again, it proves that alcohol has the strongest correlation with quality.
I combine the top 3 correlated attributes with quality in this plot and then remove the points with a moderate quality of 5 and 6 to get a clearer vision about the effect of each attribute on wines quality. The plots proves the analyses we’ve done in the previous part. That is, wines with high quality scores have lower volatile acidity and higher alcohol volume and citric acid content.
The top three correlated features to quality, which are alcohol, citric acid and volatile acidity, strengthen each other in our mulivariate scatter plots. In a word, wines with high quality have relatively higher alcohol volume and citric acid and relatively lower volatile acidity.
According to the scatter plots in this section, I found that point are more dispersed in citric acid dimension compared to the other two (alcohol and volatile acidity), which seems reasonable since citric acid is found in small quantities and can add ‘freshness’ and flavor to wines according to what is said by author.
Since there are many outlies in the dataset and the strongest correlation(alcohol with quality) only get a r value under 0.5, I cannot find a very precise model to predict the quality of wines.
The quality of wine is rated by at least 3 wine experts between 0 (very bad) and 10 (very excellent), and there are only 6 discrete values (3,4,5,6,7,8) in our dataset. Most of the scores are 5 and 6, they account for 42.6% and 39.9% of the whole dataset. Which means most of the wines are moderate, bad wines (quality 3) and excellent wines (quality 8) in our dataste only account for 0.63% and 1.1%.
The second plot includes three boxplots of attributes (alcohol, volatile acidity and citric acid) by quality, these three attributes have the top 3 strongest correlation with quality. And it’s not hard for us to find that quality is negatively related with volatile acidity and positively related with alcohol and citric acid. If we focus on the median values of the box, we can find that the relationships of these three attributes with quality are not exactly linear.
I put all of the three attributes which has the strongest correlation with quality in the third plot, which can help us get clearer insights about the relationships between these attributes. And in the left scatter plot there are too many points with quality of 5 and 6 that may affect our determination of the relationship between attributes, so I add a plot which only keep points with relatively extreme quality scores (3,4,7,8) to get a better vision. With these two plots, it’s not hard for us to find that most points with quality of 7 and 8 are in the right-upper area of plots with a smaller point size, and the points of 3 and 4 are in a obviously opposite way. That is, red wines of high quality have a relatively higher alcohol volume and citric acid content and a lower volatile acidity, and bad wines are in the opposite way.
The red wine dataset includes 1599 observations with 11 attributes on the chemical propertied of the wine and quality of the wine which is rated by at least three wine experts between 0 (very bad) and 10 (very excellent). So the quality is an subject variable and the rest 11 ones are objective. What I’m most interested about this dataset is to find which attributes of the red wines have an effect on the red wines quality. Then I follow the guidance provided in the template to start my exploration.
I have to say that it’s difficult for me to start the EDA process for the red wines dataset. Since it’s not like the diamonds dataset in the learning lessons, which we are very familiar with and has a relatively obvious relationship between variables and even before we actually start the analysis, we can find some potienally related variables by our intuition and experiences. However, the red wines contains many chemical variables I’m not familiar with. So what I did is to take a deep look in the doc file provided by the author of dataset befor I go deep into the analysis process. This helps a lot and give me some tips on attributes which I maybe need to put more attention to.
And another problem I face with during the Bivariate and Multivariate Analysis is the correlation between attributes like alcohol, volatile acidity and citric acid with quality is moderately to weakly. And these variables also have correlations with each other, so it’s hard to identify the fundamental factors that actully affect the quality of wines based on the given dataset.
Last, I did not create any model to predict the quality of red wines in my EDA process. One reason is what I’ve mentioned above, the correlations between variables are not very strong. The second one is there are no records of wines with quality under 3 or above 8. So if we can get a more complete dataset of red wines in the future, it would be easier for us to create a nice model to predict the wines quality.